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Abstract 

Template matching by means of cross-correlation is common practice in pattern recognition. However, 
its sensitivity to deformations of the pattern and the broad and unsharp peaks it produces are signifi- 
cant drawbacks. This paper reviews some results on how these shortcomings can be removed. Several 
techniques (Matched Spatial Filters, Synthetic Discriminant Functions, Principal Components Projections 
and Reconstruction Residuals) are reviewed and compared on a common task: locating eyes in a database 
of faces. New variants are also proposed and compared: least squares Discriminant Functions and the 
combined use of projections on eigenfunctions and the corresponding reconstruction residuals. Finally, 
approximation networks are introduced in an attempt to improve filter design by the introduction of 
nonlinearity. 
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1 Introduction 

The detection and recognition of objects from their im- 
ages, irrespective of their orientation, scale, and view, 
is a very important research subject in computer vision, 
if not computer vision itself. Several techniques have 
been proposed in the past to solve this challenging prob- 
lem. In this paper we will focus on a subset of these 
techniques, those employing the idea of projection to 
match image patterns. The notion of Matched Spatial 
Filter (hereafter MSF) is a venerable one with a long 
history [21]. While by itself it cannot account for invari- 
ant recognition, it can be coupled to invariant mappings 
or signal expansions, and is therefore able to provide 
invariance to rotation and scaling in the image plane. 
In order to cope with more general variations of the ob- 
jects views more sophisticated approaches have to be em- 
ployed. Among them, the use of Synthetic Discriminant 
Functions [17, 9, 6, 14, 15, 20, 28, 19, 8, 7, 26, 18, 16] is 
one of the more promising so far developed. In these 
paper we will follow a path from MSF, to expansion 
matching through different variant of SDFs. The first 
section describes the basic properties of MSF, their op- 
timality and their relation to the probability of misclas- 
sification. The generalization of MSF to a linear com- 
bination of example images is introduced next. Several 
shortcomings of the basic approach are outlined and a 
set of possible solutions is presented in the subsequent 
section. We discuss a relation of the resulting class of fil- 
ters to nonorthogonal image expansion. A generalization 
to projections on multiple directions and the use of the 
projection residual for pattern matching is then investi- 
gated [24, 22, 27, 29, 30, 31]. Finally, a more powerful, 
non linear framework is introduced in which template 
matching can be looked at as a problem of function ap- 
proximation. Network architectures and training strate- 
gies are proposed within this new general framework. 

2 Matched Spatial Filter 

Template matching is extensively used in low-level vision 
tasks to localize and identify patterns in images. Two 
methods are commonly used: 

1. image subtraction: images are considered as vec- 
tors and the norm of their difference is considered 
as a measure of dissimilarity; 

2. correlation: the dot product of two images is con- 
sidered as a measure of their similarity (it repre- 
sents the angle between the images when they are 
suitably normalized and considered as vectors). 

When the images are normalized to have zero average 
and unit norm, the two approaches give the same re- 
sult. The usual implementation of the above methods 
relies on the euclidean distance. Other distances can 
be used and some of them have better properties such 
as increased robustness to noise and minor deformations 
[4]. The next sections are mainly concerned with the 
correlation approach. The idea of image subtraction is 
introduced again in the more general nonlinear frame- 
work. 



2.1 Optimality 

One of the reasons for which template matching by corre- 
lation is commonly used is that correlation can be shown 
to be the optimal (according to a particular criterion) 
linear operation by which a deterministic reference func- 
tion can be extracted from additive white Gaussian noise 
[21]. Let the detected signal be: 



g(x) — <f)(x) + \{x) 



(i) 



where <f)(x) is the original signal and \(x) is noise with 
power spectrum S(lo). The noise is assumed to be wide- 
sense stationary with zero average so that: 

E{\{x)} = 
E{\(x + a)\(x)} = R(a) 

We assume that <f)(x) is known and we want to establish 
its presence and location. To do so we apply to the 
process g{x) a linear filter with impulse response h(x) 
and system function H(lo). The resulting output is: 



z(x) — g(x) * h(x) — i g(x — a)h(a)da (2) 

J — oo 

= z${x) + z\(x) (3) 

Using the convolution theorem for the Fourier transform 
we have that: 



z<j>(x) 



1 

2^ 



(x — a)h(a)da 

o 

$(u)H(u)e iux du 



We want to find H(u>) so as to maximize the following 
signal to noise ratio (SNR): 



\z4,(x )\ 2 
E{z* x (xo)} 



(4) 



where xq is the location of the signal. The SNR repre- 
sents the ratio of the filter responses at the uncorrupted 
signal and at the noise. It is defined at the true location 
of the signal (usually the correlation peak) therefore not 
taking into account the off-peak response of the filter. 
Two cases of particular interest are those of white and 
colored noise: 

White Noise This type of noise is defined by the 
following condition 

S(co) = So 

which corresponds to a flat energy spectrum. The 
Schwartz inequality states that 



f(x)g(x)dx\ 2 < I l/MI 2 / \g(x)\ 2 



and the equality hold iff f(x) = kg(x) (we use ~ to rep- 
resent complex conjugation). This implies the following 
bound for the signal to noise ratio r: 

2 / \®(u)e iu}XQ \ 2 dco f \H(co)\ 2 dio 



2irS f\H(Lo)\ 2 dLo 



and then 



where 



2 E§ 
r < 



So 



Z7T J 



represents the energy of the signal. From the Schwartz 
inequality the equality holds only if 

H(lo) = k$(Lo)e- iujXo 

The spatial domain version of the filter is simply the 
mirror image of the signal: 

h(x) — k<f)(xo — x) 

which implies that the convolution of the signal with the 
filter can be expressed as the cross-correlation with the 
signal (hence the name Matched Spatial Filter). 

Colored Noise If the noise has a non flat spectrum 
S(lo) it is said to be colored. In this case the following 
holds: 



2irz {f> (x ) = f ^(io)H(co)e iujx dio 

|2^(x )| 2 = | /^P^y^H^He-^l 2 

J y/S(u)) 

^ J S(u>)\H(u>)\ 2 du> 



< i m^-\ 2 duo 



hence 



o 1 f \<$>(Lo)e iu)X \ 2 , 
r 2 < — / ' V, , ' dco 



2tt J S(lo) 
with equality holding only when 



^S{lo)H{lo) = k 



Vs&) 



The main consequence of the color of noise is that the 
optimal filter corresponds to a modified version of the 
signal 

which emphasizes the frequencies where the energy of the 
noise is smaller. The optimal filter can also be considered 
as a cascade of a whitening filter S* -1 ' 2 (i<;) and the usual 
filter based on the transformed signal. 

In the spatial domain, correlation amounts to project- 
ing the signal g(x) onto the available template <j>(x). If 
the norm of the projected signal is not equal to that of 
the template, the value of the projection can be mean- 
ingless as the projection value can be large without im- 
plying that the two vectors are close in any reasonable 
sense. The solution is to compute the projection using 
normalized vectors. In particular, if versors are used, 
computing the projection amounts to computing the co- 
sine of the angle formed by the two vectors, which is 
an effective measure of similarity. In vision tasks, vec- 
tor normalization corresponds to adjusting the intensity 
scale so that the corresponding distribution has a given 



variance. Another useful normalization is to set the av- 
erage value of the vector coordinates to zero. This oper- 
ation corresponds to setting the average of the intensity 
distribution for images. These normalization are par- 
ticularly useful when modern cameras are used, as they 
usually operate with automatic gain level (acting on the 
scale of the intensity) and black level adjustment (acting 
as on offset on the intensity distribution). 

2.2 Distorted Templates 

The previous analysis was focused on the detection of a 
deterministic signal corrupted by noise. An interesting 
extension is the detection of a signal belonging to a given 
distribution of signals [17]. As an example, consider the 
problem of locating the eyes in a face image. We do 
not know who's face it is so that we cannot select the 
corresponding signal (the eyes of that person). A whole 
set of different eyes could be available, possibly including 
the correct ones. 

Let {<f)(x)} denote the class of signals to be detected. 
We want to find the filter h which maximizes the SNR 
r 2 over the class of signals {(f)(x)}. The input signal <j>{x) 
can be modeled as a sample realization of the stochastic 
process {<f)(x)}. The ensemble-average correlation func- 
tion of the stochastic process is defined by 



K H (x,y) = E^ix^iy)} 



(5) 



and represents the average over the ensemble of signals 
(and not over the coordinates of a signal). What we 
want to maximize is the ensemble average of the signal 
to noise ratio: 



E*{r 2 } 



£{M^o)| 2 } 
E{zh(xo)} 



(6) 



Assume, without loss of generality, that xq = 0. The 
average SNR can then be rewritten as: 



E^r 2 } 



f f h(-x)h(-y)K (t><t> (x,y)dxdy 
f f h(-x)h(-y)Kx\(x,y)dxdy 



(7) 



where the ensemble autocorrelation function of the sig- 
nal and noise have been used. The autocorrelation func- 
tion of the white noise is proportional to a Dirac delta 
function: 

K xx (x,y) = N6(x-y) (8) 

so that the average signal to noise ratio can be rewritten 
as: 

r 2 X 1 1 h{-x)h{-y)K H {x,y)dxdy 

E ' {r } = Nfh(-xy d x (9) 

Pre-whitening operators can be applied as preprocessing 
functions when the assumption of white noise does not 
hold. The denominator of the RHS in eqn. (9) represents 
the energy of the filter and we can require it to be 1: 



h(—x) dx — 1 



(10) 



To optimize eqn. (9) we must maximize the numerator 
subject to the energy constraint of the filter. The ensem- 
ble auto-correlation function can be expressed in terms 




Figure 1: The probability of error, represented by the 
shaded area, when the distributions are Gaussian with 
the same covariance. 



of the orthonormal eigenfunctions of the integral kernel 



K H (x,y) = ^2\iipi(x)ipi(y) 



(11) 



where the A z - are the corresponding eigenvalues. The 
filter function h can also be expanded in the same basis 



h(-x) = y^i^(ff) 



(12) 



Using the inner product notation and the orthonormality 
of the ipi(x) we can state the optimization problem as 
finding 

arg max } Xi(h • tbA (13) 

y ^ ( =i^ 

If we order the eigenvalues so that Ai > A2 > . . . > A& > 
. . ., we have 

N-E^r 2 } = J2Hh-A) 2 = Y. XiUj2 i ^ ^Y.^ = Al 

i i i 

and the maximum value is achieved when the filter func- 
tion is taken to be the dominant eigenvector. 

2.3 Signal-to-noise ratio and classification error 

Several performance metrics are available for correlation 
filters that describe attributes of the correlation plane. 
The signal to noise ratio (SNR) is just one of them. 
Other useful quantities are the peak-to-correlation en- 
ergy, the location of the correlation peak and the in- 
variance to distortion. As correlation is typically used 
to locate and discriminate objects, another important 
measure of a filter's performance is how well it discrim- 
inates between different classes of objects. The simplest 
case is given by the discrimination between the signal 
and the noise. In this section we will show [16, 12] that 
for the classical matched filter maximizing the SNR is 
equivalent to minimizing the probability of classification 
error P e when the underlying probability distribution 
functions (PDFs) are Gaussians. 

The classifier which minimizes the probability of er- 
ror is the Bayes classifier. For two normal distributions, 
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Figure 2: The cross-correlation of the template reported 
on the right. Note the diffuse shape of the peak that 
makes its localization difficult 



the Bayes decision rule can be expressed as a quadratic 
function of the observation vector x as 

-(x - m A ) T ^A 1 ( x ~ m ^) ~ 

-(x - m B ) T ^B 1 ( x ~ m B) + 



(15) 



B 



Pb 



where mA,TriB are the distribution means, E^,E.b the 
covariance matrices and Pa,Pb the occurence probabil- 
ities. 

Let us consider two classes: a deterministic signal <fi 
corrupted with white Gaussian noise as class A and the 
noise itself as class B. In this case ijia — 4>, tub — 
and Yja — ^b — I- This means that the components 
of the signal <fi are uncorrelated and have unit variance. 
If we further assume that the a priori probabilities of 
occurence of these classes are equal, the probability of 
error (see also Figure 1) is given by: 



1 f°° 
P e = -= / exp(-u 2 /2)di 



(16) 



where r\ — t^ 1 ' 2 , with £ being the Mahalanobis distance 
between the PDFs of the two classes: 

i - (m A - m B ) T I{m A - m B ) - <j> T 4> (17) 
and the Bayes decision rule simplifies to: 

x <E A if cj> T x > ^ 



x e B if<l> T x < ^ 



(18) 
(19) 



The input vector x is then classified as signal or noise 
depending on the value of the correlation with the uncor- 
rupted signal. We have already shown that correlation 
with the signal maximizes the signal to noise ratio, so 
when the noise distribution is Gaussian maximizing the 
SNR is equivalent to minimizing the classification error 
probability. When the noise is not white, the signal can 
be transformed by applying a whitening transformation 
A: 

A T ZA = I (20) 

and the previous reasoning can be applied. 



3 Synthetic Discriminant Functions 

While correlators are optimal for the recognition of pat- 
terns in the presence of white noise they have three major 
limitations: the output of the correlation peak degrades 
rapidly with geometric image distortions, the peak is of- 
ten broad (see Figure 2), making its detection difficult, 
and they cannot be used for multiclass pattern recog- 
nition. It has been noted that one can obtain better 
performance from a multiple correlator (i.e. one com- 
puting the correlation with several templates) by form- 
ing a linear combination of the resulting outputs instead 
of, for example, taking the maximum value [10, 11]. 
The filter synthesis technique known as Synthetic Dis- 
criminant Functions (hereafter SDF) starts from this 
observation and builds a filter as a linear combination 
of MSFs for different patterns [9, 6]. The coefficients 
of the linear combination are chosen to satisfy a set of 
constraints on the filter output, requiring a given value 
for each of the patterns used in the filter synthesis. By 
forcing the filter output to different values for different 
patterns, multiclass pattern recognition can be achieved. 
Let {<f>i(x)}i=i } ... in be a set of (linearly independent) im- 
ages and u — {i/i, . . . , u n } T be a vector representing the 
required output of the filter for each of the images: 



Generalized Matched Filters 



<j )i (g) h — Ui 



(21) 



where (x) represents correlation (not convolution). The 
filter h can be expressed as a linear combination of the 
images </> z -: 

h{x)= J2 h M x ) ( 22 ) 

i = 1 , . . . , n 

as any additional contribution from the space orthogo- 
nal to the images would yield a zero contribution when 
correlating with the image set. If we denote by X the 
matrix whose columns represent the images (represented 
as vectors by concatenating their rows), by enforcing the 
constraints we obtain the following set of equations: 



b = (x T xy 



(23) 



which can be solved as the images are linearly inde- 
pendent. The resulting filter is appropriate for pattern 
recognition applications in which the input object can 
be a member of several classes and different distorted 
versions of the same object (or different objects) can be 
expected within each class. If M is the number of classes, 
Hi is the number of different pattern within each class i, 
TV the overall number of patterns, M filters can be built 
by solving 



h = {X T X)- 1 6 i 



1,...,M 



where 



$ik 



i e;: 

otherwise 



< * < £■= 



(24) 



(25) 



k = 1, . . . , TV and image <f>k belongs to class i if 6^ — 1. 
Discrimination of different classes can be obtained also 
using a single filter and imposing different output values. 
However the performance of such a filter is expected to 
be inferior to that of a set of class specific filters due 
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Figure 3: An increasing portion of a set of 30 eyes images 
was used to build a SDF, an average MSF or a set of 
prototype MSFs from which the highest response was 
extracted. Our new least square SDF uses four building 
templates. The plot report the average responses over 
a disjoint 30 image test set. Note that the lower values 
of MSFs are due to the fact that their response is not 
scaled to obtained a predefined value as opposed to SDFs 
whose output is constrained to be 1, and to approximate 
1 for ls SDFs. 



to the high number of constraints imposed on the filter 
outputs [9]. While this approach makes it easy to ob- 
tain predefined values on a given set of patterns it does 
not allow to control the off-peak filter response. This 
can prevent reliable classification when the number of 
constraints becomes large. 

The effect of filter clutter can also appear in the con- 
struction of a filter giving a fixed response over a set of 
images belonging to the same class (the Equal Correla- 
tion Filter introduced in [9]). 

In order to minimize this problem we propose a new 
variant of SDFs: least quares SDFs. These filters are 
computed using only a subset of the training images 1 
and the coefficient of the linear combination are chosen 
to minimize the square error of the filter output on all of 
the available images. In this case the matrix R — X X 
is rectangular and the estimate of the b relies on the 
computation of the pseudoinverse of R: 



i?t = (R T R)-itf 



(26) 



The dimension of the matrix to be inverted isnxn where 
n represents the number of images used to build the fil- 
ter and not the (greater) number of training images. By 
using a reduced number of building templates the prob- 
lem of filter cluttering is reduced. A different use of least 
square estimation for filter synthesis can be found in [6] 
where it is coupled to Karhunen-Loeve expansion for the 
construction of correlation SDFs. 



The subset of training images can be chosen in a variety 
of ways. In the reported experiments they were chosen at 
random. Another possibility is that of clustering the available 
images, the number of clusters being equal to the number of 
images used in filter synthesis. 




Figure 4: The MSFs resulting from using 20 building 
images in the SDF (left) and 2 in the least square SDF 
(right) when using the same set of training images. The 
difference in contrast of the two images reflect the mag- 
nitude of the MSFs. The performance of the two filters 
was similar. 



The results for a sample application are reported in 
Figure 3. Note that by using a least square estimate a 
good performance can be achieved using a small number 
of templates. This has a major influence on the appear- 
ance of the resulting MSF as can be seen in Figure 4. 

Another variant is to use symbolic encoded filters [9]. 
In this case a set of k filters is built whose outputs are 
or 1 and can be used to encode the different patterns 
using a binary code. In order to use the filter for clas- 
sification, the outputs are thresholded and the resulting 
binary number is used to index the pattern class. 

Synthesis of the MSF from a projection SDF algo- 
rithm can achieve distortion invariance and retain shift 
invariance. However, the resulting filter cannot prevent 
large sidelobe levels from occurring in the correlation 
plane for the case of false (or true) targets. The next 
section will detail the construction of filters which guar- 
antee controlled sharp peaks and good noise immunity. 

4 Advanced SDFs 

The signal to noise ratio maximized by the MSF is lim- 
ited to the correlation peak: it does not take into account 
the off-peak response and the resulting filters often ex- 
hibit a sustained response well apart from the location 
of the central peak. This effect is usually amplified in 
the case of SDF when many constraints are imposed on 
the filter output. In order to locate the correlation peak 
reliably, it should be very localized [14]. However, it can 
be expected that the greater the localization of the filter 
response (approaching a 6 function) the more sensitive 
the filter to slight deviations from the patterns used in 
its synthesis. This suggests that the best response of the 
filter should not really be a 6 function, but some shape, 
like a Gaussian, whose dispersion can be tuned to the 
characteristics of the pattern space. In this section we 
will review the synthesis of such filters in the frequency 
domain [26]. 

Let us assume for the moment that there is no noise. 
The correlation of the i-th pattern with the filter h is 
represented by 



Zi(n) — <j>i(n) (g) h(n) 



0, 



1 



(27) 



where d is the dimension of the patterns. In the follow- 
ing capital letters are used to denote the Fourier trans- 
formed quantities. The filter is also required to produce 



an output Ui for each training image: 

Zi(0) = ui (28) 

which can be rewritten in the Fourier domain as: 



H + X = du 



(29) 



where + denotes complex conjugate transpose. Using 
Parseval's theorem, the energy of the i-th circulant cor- 
relation plane is given by: 



Ei 



d-l 

E 

n = 



Zi(n)f 



1 d-l 
k = 



1 d-l 



Mk)\ 2 = -,yjH(k)f\Mk)\ 2 



k=0 



( 3 °) 

When the signal is perturbed with noise the output 
of the filter will also be corrupted: 



z,-(0) = &(0)®/i(0) + A(0)®/i(0) 



(31) 



Under the assumption of zero-mean noise, the variance 
of the filter output due to noise is: 



£* = JZ>(*)i 25 (*) 



(32) 



k=0 



where S(k) is the noise spectral energy. What we would 
like is a filter whose average correlation energy over the 
different training images and noise is as low as possible 
while meeting the constraints on the filter outputs. A 
first choice is to minimize: 



E = J2(E'+ E n) 



(33) 



EE 

i k 



H(k)\\\Mk)\ 2 + S(k)) (34) 



subject to the constraints of eqn. (29). However, mini- 
mizing the average energy (or filter variance due to noise) 
does not minimize each term, corresponding to a partic- 
ular correlation energy (or noise variance). A more strin- 
gent bound can be obtained by considering the spectral 
envelope of the different terms in eqn. (34): 

E = J2 \H(k)\ 2 max^fc)! 2 , . . . , \$N(k)\ 2 , S(k)) (35) 

k 

If we introduce the diagonal 

matrix T kk = 7Vmax(|$i(&)| 2 , . . . , |$tv(&)| 2 , S(k)) the 
filter synthesis can be summarized as minimizing 

E = H+TH 



subject to 



H + X = du 



(36) 



(37) 



This problem can be solved [20] using the technique of 
Lagrange multipliers to minimize the function: 

TV 

S = H+TH - 2 Y^ MH+Xi - dui) (38) 

2 = 1 

where Ai, . . . , A^v are the parameters introduced to sat- 
isfy the constrained minimization. By zeroing the gra- 
dient of S with respect to H we can express J? as a 









Figure 5: Using an increasing amount of added white 
noise the emphasis given to the high frequency is reduced 
and the resulting filter response approaches that of the 
standard MSF. 



function of T and of A = {Ai, . . . , A^}. By substitution 
into eqn. (37) the following solution is found: 

H = T~ 1 X(X + T~ 1 X)~ 1 u (39) 

The use of the spectral envelope has the effect of reduc- 
ing the emphasis given by the filter to the high frequency 
content of the signal, thereby improving intraclass per- 
formance. It is important to note that the resulting filter 

1/9 

can be seen as a cascade of a whitening filter T ' and a 
conventional SDF based on the transformed data. Note 
that in this case the whitened spectrum is the envelope 
of the spectra of the real noise and of the training im- 
ages. A least square approach may again be preferred 
to cope with a large number of examples. In this case 
all available images are used to estimate T but only a 
subset of them is used to build the corresponding SDF. 
Experiments have been reported using a white noise of 
tunable energy a to model the intraclass variability [26] 



E 



k 



|#(&)| 2 max(|$i(&)| 2 



\$ N (k)\ 2 ,a) (40) 



Adding white noise limits the emphasis given to high fre- 
quencies, reducing the sharpness of the correlation peak 
and increasing the tolerance to small variations of the 
templates (see Figures 5 and 6). A comparison of dif- 
ferent filters is reported in Figure 7. The effects of non 
linear processing emphasizing the high frequencies to ob- 
tain a sharp correlation peak is reported in Figure 8. 

Another way of controlling the intraclass performance 
is that of modeling the correlation peak shape [8, 18]. As 
already mentioned, the shape of the correlation peak is 
expected to be important both for its detection and for 
the requirements imposed on the filter which can impair 
its ability to correlate well with patterns even slightly dif- 
ferent from the ones used in the training. Let us denote 
with F(k) the required shape of the correlation peak. 
The shape of the peak can be constrained by minimizing 



the squared deviations of its output from the required 
shape F: 



N d 



£* = ££!#(*)**•■(*) -wi 2 (4i) 



=i fc=i 



where, for instance, f{x) — exp(—x 2 /2a 2 ) is a Gaussian 
amplitude function. By switching to matrix notation, 
the resulting energy can be expressed as: 



E s 



H + DH + F + F - H + AF - F + A + H (42) 



where A is a diagonal matrix whose elements are the 
sum of the components of ^ and D is a diagonal ma- 
trix whose elements are the sum of the squares the com- 
ponents of <&i. The first term in the RHS of eqn. (42) 
corresponds to the average correlation energy of the dif- 
ferent patterns (see eqn. (30)). We suggest the use of 
the spectral envelope T instead of D, employed in the 
original approach, thereby minimizing the following en- 
ergy 



E f s = H+TH 



F + F-H + AF-F + A + H>E S (43) 



The minimization of Es subject to the constraints of eqn. 
(29) can be done again using the Lagrange multiplier and 
is found to be: 

H = T- 1 X(X + T- 1 X)- 1 du (44) 

+T _1 AF - T- 1 X(X + T- 1 X)- 1 X + T~ 1 AF 

These filters provide a controlled, sharp correlation peak 
subject to the constraints on the filter output, the re- 
quired correlation peak shape and the reduce variance 
to the noise. In our experiments the Fourier domain 
was used to compute the whitening filters. They were 
then transformed to the spatial domain where a stan- 
dard correlation was computed after their application. 
An approach using only computations in the space do- 
main can be found in [28]. 

5 Nonorthogonal Image Expansion and 
SDF 

In this section we review an alternative way of looking at 
the problem of obtaining sharp correlation peaks, namely 
the use of nonorthogonal image expansion [2, 3]. Match- 
ing by expansion is based on expanding the signal with 
respect to basis functions (BFs) that are all translated 
versions of the template. Such an expansion is feasible 
if the BFs are linearly independent and complete. It 
can be proven that self-similar BFs of compact support 
are independent and complete under very weak condi- 
tions. Suppose one wants to estimate the discrete d- 
dimensional signal g(x) by a linear combination of basis 
functions <j>i{x)\ 



9*( x ) = ^2 c iM x ) 



(45) 



where <f>i(x) now represents the i-th circulated transla- 
tion of <j). The coefficients are estimated by minimizing 




Figure 6: The output of the correlation with an SDF 
computed using the spectral envelope of 10 training im- 
ages and different amounts of white noise (left: a = 1, 
middle a = 5) compared to the output of normalized 
cross-correlation using one of the images used to build 
the SDF but without any spectral enhancement. The 
darker the image the higher the corresponding value. 
Note that an increased amount of white noise improves 
the response of the filter. 





Figure 8: Non linear processing can be employed. The 
figure represent the result of preprocessing the image 
to extract the local image contrast (intensity value over 
the average value in a small neighborhood). This kind 
of preprocessing emphasizes high frequencies and results 
in a sharp correlation peak. 



the square error of the approximation \g — g f \ 2 . The ap- 
proximation error is orthogonal to the basis functions so 
that the following system of equations must be solved: 



Y,<<t> 1 ,<t> j > 



d 

E 



< $d><l>j > c j 



< g,<i>i> 



<9><t>d> 



(46) 




Figure 7: The output of the correlation with an SDF 
computed using the spectral envelope of 20 training im- 
ages as whitening preprocessing. Left: the normal SDF 
(20 examples). Right: a least square SDF with 6 tem- 
plates (20 examples). The darker the image the higher 
the corresponding value. The least square SDF exhibits 
a sharper response using the same whitening filter. 



If the set of basis functions is linearly independent the 
equations give a unique solution for the expansion coef- 
ficients. If we consider the advanced SDF for the case of 
no-noise, single training image and working in the spa- 
tial domain [28], we have that the corresponding filter 
can be expressed as: 



h = (M T M)-V 



(47) 



where the columns of matrix [•] are the circulated basis 
functions <j>i. The output of the correlation is then given 
by: 

[<j>]h = c= [4> T ]~ V (48) 

The solution of the system (47) can be expressed as: 

c=[^ T ]-V (49) 

which is clearly the same. In the case of no noise the 
resulting expansion is c = (0, . . . , 0, 1, 0, . . .0) with a sin- 
gle 1 at the location of the signal. The idea of expansion 
matching is also closely related to correlation SDFs [6] 
where multiple shifted templates were explicitly used to 
shape the correlation peak. Let us consider a set of tem- 
plates obtained by shifting the original pattern (possibly 
with circulation) on the regular grid defined by the im- 
age coordinates. We can require that the correlation 
value of the original pattern with its shifted versions be 
1 when there is no shift and for every non null shift. 
This corresponds to a filter whose response is given by 
c— (0, . . . , 0, 1, 0, . . .0) as previously described. 

6 Other projection approaches 

The whole idea of projection Synthetic Discriminant 
Functions is to find a direction onto which the pro- 





(a) 



(b) 



Figure 9: Computing the distance from linear subspace 
(a) versus computing the distance from a single proto- 
type (b). In drawing (a), vector < <j> > represents the 
average pattern and the horizontal line on which <p L lies 
represents the linear subspace. <p L is the projection of 
pattern <fi on the linear space and <p R is the projection 
residual. Drawing (b) shows two vectors <f>i and <p 2 with 
the same distance from the average pattern A but dif- 
ferent distances from the pattern space. 



jections of the different signals have predefined val- 
ues. A typical image with 256 x 256 pixels is pro- 
jected, for recognition purposes, onto a single direction 
in this high dimensional space. Another approach is to 
project the signal to be recognized onto a linear subspace 
[32, 27, 22, 29, 30, 31]. Let us first assume that the pat- 
terns of each of the classes to be discriminated belongs to 
different linear subspaces. For each class it is then pos- 
sible to determine an orthogonal transformation which 
diagonalizes the covariance matrix. The elements of the 
transformed basis are the eigenvectors of the covariance 
matrix and are called principal components. They can 
be sorted by decreasing contribution to the covariance 
matrix, as represented by the corresponding entry in the 
diagonal covariance matrix [12]. The number of vectors 
in the basis is equal to the minimum between the number 
of available class pattern and the dimensionality of the 
embedding space. Each pattern in the class can usually 
be described by using only the most important compo- 
nents. The resulting restricted basis spans a linear sub- 
space in which the patterns of the represented class can 
be found. Each possible pattern <fi can be projected onto 
the set of principal components and can be described as 
the sum of its projection <p L . plus an orthogonal residual 

<l>Ri : 

<f> = <l> Li + <l> Ri + < <l>i > (50) 

where i identifies the class and < <j>i > is the correspond- 
ing centroid. A comparison with the usual technique of 
computing the distance from a single pattern (e.g. the 
centroid) is reported in Figure 9. 

An important class of objects spanning a linear space 
is given by the ortographic projections of rigid sets of 
points when looked at from different positionsfl, 23]. 
Different objects span different 6-dimensional linear 
spaces. This can be used to recognize them, irrespective 
of their orientation in space, by computing the magni- 
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Figure 10: The square of coordinates ij represents the 
average value of distances of views of the 2-th and 
j-th clip (darker values represent smaller distances). 
LEFT: euclidean distances of views of the different clips; 
RIGHT: distances computed using the learned metric 
W?W 
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Figure 11: Eigenvalues of the learned metric matrix 
W T W. Note that there is compatibility with the find- 
ings of Basri-Ullman that under ortographic projection 
the rank of the metric is 6. 



tude of the projection residual over the individual lin- 
ear spaces (see Figure 9). Under perspective projection, 
when viewing an object from a reasonable distance, we 
expect that a 6-dimensional linear space can still provide 
a good approximation to the real manifold. Further anal- 
ysis of the recognition experiments reported in [5] has 
shown that an HyperBF network [25] with a single unit 
is in fact able to learn the approximating linear space 
from a set of example views of different objects. The 
experiments used paper clips characterized by 6 feature 
points in the image plane, resulting in 12-dimensional 
vectors after perspective projection. The 2-th clip was 
characterized by a one-unit HyperBF network: 



C,-(^) = exp{-{ct> - ti) T W?Wi(<t> - «,-)) 



(51) 



where <j> is the 12-dimensional input to the network, t 2 - is 
a sample view (a prototype) of the 2-th clip and WfWi 
represents a metric. The network is trained by modifying 
ti and WjWi to obtain Ci(</>) ~ 1 when <fi is a view of 
the i-th clip and C {(</>) ~ when it is not. The effects 
of the resulting metric WfWi on the computation of 
distances between different views of the clips can be seen 



in Figure 10. The distance computed using the learned 
metric is effectively the size of the projection residual. 
The eigenvalues of WfWi (see Figure 11) are compatible 
with a 6-dimensional embedding of the pattern space. 

If the linear subspace is the one spanned by the first 
k eigenvectors of the covariance matrix, the sum of the 
eigenvalues corresponding to the ignored components 
can be used as an estimate for \<fiji\ when the pattern 
belongs to the given class. In particular it can be used 
to accept or reject the pattern according to a threshold 
on the size of the residual 



where 



\<I> R \ 2 <6 

i>k 



(52) 



(53) 



and (3 > 1 is an heuristic factor taking into account how 
good an estimate ^2 i>k A z - is of the residual error. In Fig- 
ure 12 we report the fraction of image pixels classified 
as right eye as a function of the threshold on the resid- 
ual. The fact that the residual is small (compared to 
6) does not imply that the pattern belongs to the given 
class. Thresholding on the residual error should then 
be supplemented by the use of classification techniques 
in each of the linear subspaces, taking into account the 
distributions of the patterns. If, for instance, the distri- 
bution of the points in the linear subspace is Gaussian, 
the parameters of the distribution can be computed and 
the probability of a pattern with given coordinates esti- 
mated (see Figure 13 where an example is reported). If 
we denote by X{ the i-th component of <j> the following 
relation holds if the distribution in the feature space is 
Gaussian: 

V oc n 8 -e-^/( 2A ') (54) 

where n is the number of patterns used for computing 
the principal components. The resulting map can be 
used in conjunction with the distance map (see Figure 
14) to establish if a pattern of the correct class is present 
(for a similar approach, using the distance from the cen- 
troid in the projection space [29, 30, 31]). Note that in 
this particular case the probability map is much more 
effective than the residual distance map. 

It could be that a class cannot be packed tightly into 
a linear subspace. A possible improvement is to attempt 
local expansions [27, 22, 31]. Points can be clustered, and 
for each of the resulting clusters a principal component 
analysis can be attempted. The previous reasonings can 
be applied and the class is represented by a set of linear 
subspaces. A nice application of this approach can be 
found in [22] where the space spanned by faces is first 
clustered into views corresponding to different poses and 
the resulting clusters are then described by the most 
important principal components. 

7 An experimental comparison 

In order to clarify the practical relevance and the relative 
merits of the previous template-matching techniques, it 
is useful to compare them on a single task. We choose to 
assess the performance of the different techniques on the 
problem of locating eyes in frontal images of faces. This 
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Figure 12: The fraction of image pixels classified as right 
eye as a function of the threshold on the residual. The 
first 10 eigenvectors from a population of 60 images were 
used. The image used for the plot was of a person not 
in the database. The vertical line represent the thresh- 
old computed by summing the residual eigenvalues. The 
correct eye is the only selected region for d < 11, the 
other eye being selected next. 



Distribution of the 5-th KL component 




Figure 13: The distribution of the values of the 5-th prin- 
cipal component computed from 60 eyes images. Note 
the clear unimodality of the distribution which suggests 
the effectiveness of using a quadratic classifier in the fea- 
ture subspace. The other components present a similar 
distribution. 




External distance 
S.ph_ t __internal dis. 
Internal dist . 
Combined 




10 15 



Figure 14: The map of the residual size (left) and of the 
projection probability (right). Note how the probability 
is low in regions where the reconstruction is good. The 
darker the value the lower the distance and the higher 
the probability. 
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Figure 15: Distribution of the distance values orthogonal 
to the feature space when projecting onto the first 10 
eigenvectors. 



Distribution of internal distances 
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Figure 16: Distribution of the distance values within the 
feature space when projecting onto the first 10 eigenvec- 
tors. 



Figure 17: Performance of different strategies based on 
the computation of principal components. The hori- 
zontal axis reports the number of components used in 
the expansion, while the vertical axis reports the per- 
centage of eyes correctly located (see text for a detailed 
explanation). 



Least Squares SDF performance 
1 .0 I ■ ... , ■ ... , ■ ... , ■ ... ■ a w = 

o w = 1 



0.£ 



§ °" 6 
E 

° 0.4 

CD 



0.2 



0.0 




A W = 10 

w = 100 
•fr correlation 



5 10 15 

Templates 



20 



Figure 18: Performance of least squares SDF with dif- 
ferent amount of regularizing noise. Correlation perfor- 
mance is also reported using the average template and 
the whole set of available templates. The horizontal 
axis reports the number of patterns used in building the 
filters, while the vertical axis reports the percentage of 
eyes correctly located (see text for a detailed explana- 
tion). 
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is a preliminary step for identifying the represented per- 
son by comparing his/her image to a reference database. 
The available database consisted of 180 images (three 
images, taken at different time, from sixty different peo- 
ple). The eyes were manually located and images nor- 
malized by fixing the position of the eyes to standard 
values. The resulting normalized images (with an inte- 
rocular distance of 28 pixels) were used for the experi- 
ments. Three different disjoint subsets, each consisting 
of the images from 18 different people were used in turn 
for building the SDFs, IsSDF and KL expansions. Per- 
formance was then assessed on the remaining images. 

For each of the compared strategies and testing im- 
ages, a map was computed reporting at each pixel the 
absolute difference of the computed values (residual, cor- 
relation, etc.) from the required values at the pattern 
(i.e. eye) location (e.g. for the residual, 1 for corre- 
lation). The resulting maps could then be considered 
as distance maps. For each image we masked in turn 
the region of the left and right eye. The unmasked eye 
was considered to be located correctly if the smallest dis- 
tance value was within 8 pixels from the correct location 
(manually detected). 

As far as the SDFs and IsSDF are concerned, a single 
image from the represented persons was used in building 
the filters, while the computation of the KL components 
relied on all the available images in the training subset. 
For all of the techniques the test was run on 120 images. 
Both left and right eyes were used in building the filters 
and the expansions. In order to assess the performance 
of the techniques, each image was used to locate both 
eyes by masking in turn the left and right eye region 
when looking for the maximum/ minimum values ideally 
associated to the template location. 

Several variants of the KL approach have been in- 
vestigated using distances from and within the feature 
space. The external distance d e is simply the error in the 
reconstruction of the pattern using the restricted eigen- 
vector basis (see eqn. 50). The internal distance di is 
the distance computed within the linear subspace from 
its origin (the centroid of the patterns). The spherical 
internal distance d Si is the Mahalanobis distance in the 
linear subspace. 

Let us assume that the orthogonal vectors defining 
the linear pattern subspace are known or computed reli- 
ably from a subset of the available examples. We could 
then estimate the Mahalanobis distance by computing 
the variance for each of the (uncorrelated) coordinates 
using all the available samples. Some of the examples 
could be erroneous or atypical and would probably lead 
to an overestimated variance. In order to overcome 
this potential problem, a robust estimate of the scale 
parameter of each coordinate was computed using the 
tan-h M-estimators introduced by Hampel [13]. Finally 
the combined distance (see also the related approach in 
[29, 30, 31]) was computed by the following relation: 



max l — , — I (55) 

^sO "eO/ 

where the normalizing factors d S Q and d e o define the 
points at which the cumulative distribution of the in- 
ternal and external distances reaches 99% (see Figures 
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15 and 16). The performance of the different variants 
are reported in Figure 17. 

SDFs and IsSDF were built using different amounts 
of regularizing noise (using eqn. (40)) and of templates. 
The resulting performance, together with the perfor- 
mance of standard correlation is reported in Fig. 18. It 
is interesting to note the major impact of the regulariz- 
ing noise on the performance of this technique. However, 
the bias and variance of the filter responses on the test 
images are not related to the filter performance. 

The combined distance d c is the best among the com- 
pared strategies. Its decline in performance with in- 
creasing dimensionality of the expansion basis is linked 
to the trend of the external distance performance. By 
using more and more eigenvectors we allow for good 
reconstruction of patterns different from eyes. At the 
same time the scaling factor computed from the dis- 
tance distribution on the training samples becomes very 
small (should we use all of the computed eigenvectors 
the samples could be reconstructed exactly). Therefore 
the external distance is the (wrong) dominating factor 
in eqn. (55). A more sophisticated integration is pre- 
sented in [31]. The performance of the template match- 
ing strategies based on KL expansions is consistently 
higher than the one achieved by SDFs in the reported 
variants. Also, expanding a pattern onto an appropri- 
ate basis seems to provide reliable template matching to 
patterns which span a manifold which can be approxi- 
mated well (at least locally) by a linear (tangent) space 
[1,24,23,5]. 

The next section will introduce non linear machinery 
(sigmoidal and Gaussian network) for the purposes of 
pattern description and classification. 

8 Future Directions: Learning and SDF 

The description of the advanced SDFs has shown that 
that they can be considered as standard SDF working 
on a preprocessed signal. The characteristics of the orig- 
inal signal and noise are used in the synthesis of the 
preprocessing filter to achieve optimal sharpness in the 
correlator response. If we look at the patterns in the 
transformed space, the correlator output is a weighted 
average of the correlation with a set of examples: 

<f>Xx)®hXx)= J2 b i<f>X x ) ® <f>i( x ) ( 56 ) 

i = 1 , . . . , n 

where the prime refers to the transformed space. The 
patterns <pi can be randomly chosen among the avail- 
able examples or selected according to particular criteria. 
A possible strategy is to synthesize the filter incremen- 
tally: the response of the filter on all the training images 
not yet used to build the filter is computed and if the 
worst filter response is not acceptable the corresponding 
image is added to the building set and a new filter is 
computed [7]. The construction of the filter, apart from 
the phase of selecting meaningful training images is lin- 
ear. An improvement is expected with the introduction 
of nonlinearity in the filter design. We propose the use of 
approximation networks [25] to build general non linear 
filters which are able to discriminate patterns of differ- 
ent classes while giving the same response on patterns 



of the same class. These filters can be considered as a 
generalization of the projections Synthetic Discriminant 
Functions. They are built using a set of training images 
and a set of soft (i.e. not exactly met) constraints on the 
filter output. 

The general structure of the network is reported in 
Figure 19. The units of the first level represent sigmoidal 
(comparison by projection) or Gaussians (comparison by 
distance) functions: 



Oli(<j>) 



exp{-{(j)-t i ) T W T W{(j)- 
<r{cj> • t { + /?) 



*.0) 



Gaussian 
sigmoidal 
(57) 
In both cases, the system is able to mask regions of the 
templates which are not useful for obtaining the required 
output values 2 . The first level of the network can be seen 
as computing some "optimal templates" against which 
the input signal is to be compared. The output of the 
second level is computed as: 



J2 b J o ^(°i) 



(58) 



and the function implemented by unit c>2j can be of 
the Gaussian or sigmoidal type independently from the 
choice of the first layer units. The second level computes 
a non linear mapping of the projections (or distances) of 
the signal by minimizing the square error of the net- 
work on the mapping constraints (soft, as they are not 
met exactly). In some sense, the network triangulates 
the position of the input signal in pattern space using 
the distances from automatically selected reference tem- 
plates. The resulting networks have a very high number 
of free parameters and their training presents difficulties. 
Among them two are of particular concern: overfitting 
and training time in a high dimensional space. A way 
of coping with the first one is that of cross-validation 
[12]: the network undergoes training as long as its per- 
formance on a test set improves. We propose to reduce 
the effects of high dimensionality by using a hierarchy 
of networks of similar structure but working at increas- 
ing resolution. The network with the lowest resolution 
is trained first and extensively. The next network in the 
hierarchy is then initialized by suitably mapping the pa- 
rameters of the previous one. Note that only the first 
level needs to be modified structurally. A reduced train- 
ing time is expected. The procedure is iterated at all the 
levels of the hierarchy. A side effect of the hierarchical 
training is to provide fully trained networks for different 
resolutions, enabling a hierarchical approach to template 
matching. The preprocessing stage of the network is the 
one computed for the synthesis of the ASDF. The op- 
timal templates used by the first layer of the network 
can also be initialized using the building patterns of the 
linear filter. 



This is achieved during the training phase by modifica- 
tion of the entries of the matrix W , if Gaussians are used, or t % 
if sigmoids are used. Relatively small values give low weight 
to the differences of the corresponding coordinates, thereby 
making the system output weakly dependent on them. 




II 



Figure 19: An approximation network for template 
matching. The preprocessing stage applies simple trans- 
formation to the input pattern (e.g. to emphasize high 
frequency components). 



9 Conclusions 

Several approaches to template matching have been re- 
viewed and compared on a common task. A new vari- 
ant of Synthetic Discriminant Functions, based on least 
square estimation, was introduced. Several template 
matching techniques based on the expansion of pat- 
terns on principal components have been reviewed. A 
simple way of integrating internal/external distances 
within/from a linear feature space was also proposed. 
Several of the techniques mentioned in the paper have 
been compared on a common task: locating eyes in 
frontal images of different people. The techniques based 
on pattern expansion provide superior performance, at 
least in the particular task considered. Finally, a two 
layer approximation network has been proposed to gen- 
eralize the structure of SDF to a nonlinear filter. Future 
work will explore the advantages and difficulties of the 
introduction of nonlinearity. 
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